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Measuring the impact of scientific articles is important for eval¬ 
uating the research output of individual scientists, academic insti¬ 
tutions and journals. While citations are raw data for constructing 
impact measures, there exist biases and potential issues if factors af¬ 
fecting citation patterns are not properly accounted for. In this work, 
we address the problem of field variation and introduce an article level 
metric useful for evaluating individual articles’ visibility. This mea¬ 
sure derives from joint probabilistic modeling of the content in the ar¬ 
ticles and the citations amongst them using latent Dirichlet allocation 
(LDA) and the mixed membership stochastic blockmodel (MMSB). 
Our proposed model provides a visibility metric for individual articles 
adjusted for field variation in citation rates, a structural understand¬ 
ing of citation behavior in different fields, and article recommenda¬ 
tions which take into account article visibility and citation patterns. 

We develop an efficient algorithm for model fitting using variational 
methods. To scale up to large networks, we develop an online variant 
using stochastic gradient methods and case-control likelihood approx¬ 
imation. We apply our methods to the benchmark KDD Cup 2003 
dataset with approximately 30,000 high energy physics papers. 


1. Introduction. Measuring the impact and influence of scientific ar¬ 
ticles is important for evaluating the work of individual scientists (Hirsch, 
2005; Abramo and D’Angelo, 2011) and comparing journals (Garfield, 2006; 
Moed, 2010). For researchers, such information is one of the most consid¬ 
ered factors for hiring, promotion, funding decisions, award consideration 
and professional recognition. For academic journals, it is an indicator of 
a journal’s stature among its peers, which is valuable for various reasons, 
from being considered by prospective authors for paper submission to being 
sought after by readers who need authoritative opinions on a topic. 

Due to the lack of a unified definition, the quality and importance of sci¬ 
entific articles are often judged based on the journal in which they are pub¬ 
lished (Simons, 2008). In particular, a journal’s impact factor (see Garfield, 
2006), defined using a journal’s average number of citations per article, is 
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frequently used as an indicator of the “quality” of its articles and a means 
of evaluating the research output of individuals and institutions (Casade- 
vall and Fang, 2014). However, studies have shown that such usage can be 
misleading; the journal impact factor conceals differences in citation rates 
among articles, is research field dependent and does not measure the scien¬ 
tific quality of individual articles (Seglen, 1997). To improve the assessment 
of scientific research by academic institutions, funding agencies and other 
parties, the San Francisco declaration on research assessment 1 recommends 
(among other proposals) placing greater emphasis on the scientific content 
of an article rather than the journal impact factor (see Alberts, 2013). An 
increasing number of publishers and organizations are also providing article 
level metrics (Neylon and Wu, 2009; Fenner, 2014) to enable users to gauge 
the impact of articles based on their own merits. These new indicators in¬ 
clude data on usage activity, bookmarks (e.g. CiteULike and Mendeley) and 
discussions/recommendations on the Social Web (e.g. Twitter, Facebook, 
Blogs) in addition to citations. 

Citations (and other reference counts alike) are raw data for constructing 
measures to evaluate the impact of scientific articles. The h-index (Hirsch, 
2005), for instance, attempts to measure the impact of an author’s published 
work using citations (a researcher who has published h papers each having 
at least h citations has index h ). However, there exist biases and potential 
issues in using raw citations to compare the impact of scientific articles with¬ 
out accounting for other factors which may affect citation patterns. These 
factors include time from publication, journal profile, article type and so¬ 
cial network of authors (Bornmann and Daniel, 2008). A well-known and 
highly relevant factor is the variation in citation practices among different 
disciplines (Garfield, 1979). Articles in certain disciplines (e.g. Social Science 
and Mathematics) are typically much less cited than others (e.g. Molecular 
Biology and Immunology) and comparing articles using raw citation counts 
would be inappropriate. To address this issue, different procedures of nor¬ 
malizing the citation counts with respect to some reference standard have 
been proposed (e.g. Schubert and Braun, 1996; Vinkler, 2003; Radicchi, For- 
tunatoa and Castellano, 2008). More recently, Crespo, Li and Ruiz-Castillo 
(2013) and Crespo et al. (2013) consider a model where the number of ci¬ 
tations received by an article depends on the subfield to which the article 
belongs and the scientific influence of the article in the subheld. Their model 
assumes that citation impact varies monotonically with scientific influence. 

In this work, we introduce an article level measure (of citation likelihood) 

1 Outcome of a gathering of scientists at the Annual meeting of the American Society 
for Cell Biology on December 2012 
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that accounts for the variation in citation practices in different fields and is 
potentially useful for evaluating the impact of scientific articles. This mea¬ 
sure, named as topic-adjusted visibility metric , derives from joint probabilis¬ 
tic modeling of the content (text) in the articles and the citations (links) 
amongst them. We consider a framework whereby the connectivity of an 
article in a citation network depends on (1) the citation probability of the 
research fields (topics) that it belongs to and (2) its visibility to articles 
that are in a position to cite it. Our motivation is that while a citation is 
driven primarily by compatibility in research topics, the decision to cite an 
article over other equally relevant ones may be due to a complex mixture 
of attributes of the selected article which are unobserved/hard to quantify 
(e.g. research value and quality, profile of authors, journal readership) or 
difficult to model directly (e.g. time since publication, article type). Here we 
use the term visibility to capture collectively attributes of the article apart 
from research topics that accounts for its connectivity. In our model, the 
topics are discovered using only the text and connectivity information. It 
does not take into account the discipline classification of the article by the 
journal. We also use the term “field” to refer to a particular topic (area). 

As citation networks are a type of relational data where content infor¬ 
mation is available on individual nodes, our proposed model combines two 
well-established models (for text and relational data respectively): latent 
Dirichlet allocation (LDA, Blei, Ng and Jordan, 2003) and the mixed mem¬ 
bership stochastic blockmodel (MMSB, Airoldi et al, 2008). LDA is a gen¬ 
erative probabilistic model which can uncover research topics from the text 
of scientific articles, while the MMSB can detect communities within the ci¬ 
tation network and model inter- and intra-community citation probabilities. 
As the communities detected in citation networks often correlate well with 
research topics (Chen and Redner, 2010), these two models can be integrated 
by identifying the communities in MMSB with the topics in LDA (Pairwise- 
Link-LDA, Nallapati et al, 2008). We further introduce a latent variable at 
the article level into the MMSB, which scales the probability of a citation 
due to compatibility in research topics and acts as a measure of the visibility 
of individual articles. The proposed model provides a structural understand¬ 
ing of the field variation in citation behavior and a measure of visibility for 
individual articles adjusted for citation probabilities within/between topics. 

Our model can also provide article recommendations which take into ac¬ 
count individual articles’ visibility and citation patterns across different top¬ 
ics. Consider a scenario where one is searching for papers on a computational 
technique applied in multiple topic areas by using keywords. A method which 
sorts relevant articles by citation counts may yield a list where papers in 
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topics with higher citation rates are overrepresented at the top. We avoid 
this scenario as articles are recommended based on the citation probabil¬ 
ity within/across topics as well as the visibility metric which has adjusted 
for field variation in citation behavior. Hence, high impact articles in top¬ 
ics with low citation rates will not be overlooked. As the MMSB is able to 
capture both inter- and intra-topic citation probabilities, relevant articles 
which integrate multiple topics can also be identified. 

The proposed topic-adjusted visibility metric is novel and differs from 
approaches based on normalization of citation counts. While similar in mo¬ 
tivation with Crespo, Li and Ruiz-Castillo (2013) and Crespo et al. (2013), 
our model is significantly different from theirs. First, they do not consider 
the text of articles and make use of an external system provided by Thomson 
Reuters for classification (which may be limited in range) while we identify 
research topics in the articles using LDA and MMSB jointly. Moreover, our 
model does not assume that citation counts vary monotonically with visibil¬ 
ity within each topic and is fully generative for text and citations. Besides 
citations, other reference data (e.g. usage activity) which are field dependent 
can also benefit from our proposed framework. 

For model fitting, we adopt a Bayesian approach and develop efficient 
variational methods (Jordan et ai, 1999) for fast approximate posterior 
inference. As real-world citation networks are often massive and the com¬ 
putational cost of analyzing every pairwise interaction in the MMSB scales 
as the square of the number of nodes, we develop an online variant of our 
variational algorithm by subsampling the full network using case-control like¬ 
lihood approximation techniques (Raftery et ai, 2012) and stochastic vari¬ 
ational inference (Hoffman et al., 2013). Previously, stochastic variational 
inference has been employed successfully for LDA (Hoffman, Blei and Bach, 
2010) and the hierachical Dirichlet process (Wang, Paisley and Blei, 2011). 
At each iteration, it subsamples the data and optimizes the variational ob¬ 
jective using stochastic approximation methods (Robbins and Monro, 1951), 
thus reducing both computational and storage costs. Recently, Gopalan and 
Blei (2013) extend stochastic variational inference to massive networks for 
detecting overlapping communities by using a variant of the MMSB. They 
sample node pairs using “informative set sampling”, where the sets of pairs 
are defined using network topology information. A related idea is the strati¬ 
fied sampling scheme for MCMC estimation of latent space models (Raftery 
et al., 2012), where stratums are defined by shortest path lengths. Adapting 
case-control designs in epidemiology, they approximate the log-likelihood 
function by sampling all links for each node and only a small proportion of 
nonlinks from each stratum. This approach is feasible as large networks are 
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often sparse. It assumes that “closer” nodes contain more information and 
are more relevant in estimating each other’s latent position. Motivated by 
these methods, we propose a novel strategy for sampling node pairs that is 
suitably adapted to our model requirements. 

We apply our methods to the Cora dataset with 2410 scientific publi¬ 
cations in computer science research and the benchmark KDD Cup 2003 
dataset with approximately 30,000 high energy physics papers. We also eval¬ 
uate the performance of our model using a simulation study. A particularity 
of citation networks is that articles join the network over time, and published 
articles cannot cite articles appearing at a later date. Hence the absence of 
such links cannot be construed as true “zeros” and should be omitted from 
the likelihood. We show that taking into account publication times (when 
available) can significantly improve performance of our model. 

The rest of the paper is organized as follows. Section 2 lays out the details 
of our model and reviews closely associated models. Section 3 introduces a 
variational algorithm for obtaining approximate posterior inference and Sec¬ 
tion 4 describes how the algorithm can be scaled up to large networks using 
stochastic optimization methods. Section 5 discusses comparisons with al¬ 
ternative approaches and Section 6 predictions and article recommendations 
based on our proposed model. Section 7 presents application results using 
simulations and real data. We conclude with discussion in Section 8. 

2. Model description. Our proposed model combines LDA with the 
MMSB, and introduces a latent variable for each article which acts as a 
measure of its visibility adjusted for topic-level variation in activity level. 
Before describing our model, we review LDA, MMSB and other associated 
models. 

2.1. Review of LDA, MMSB and Pairwise-Link-LDA. LDA is a gener¬ 
ative probabilistic model for text copora which can be used for tasks such 
as detecting themes, summarization and classification. It assumes that word 
order can be disregarded (“bag-of-words” model) and that each document 
in the corpus exhibits K topics with varying proportions. Let the number 
of documents in the corpus be D and the size of the vocabulary be V. Each 
topic /3k is a V x 1 vector with a Dirichlet(ry) prior, representing a probability 
distribution over the vocabulary. For each document d, the topic proportions 
6d is a K x 1 vector with a Dirichlet(a) prior, representing the probability of 
each topic occurring in the document. Let the number of words in document 
d be Nd- The nth word in document d , Wd n , is generated by drawing a topic 
assignment Zd n from Multinomial^) and the word from Multinomial (/ 3 Zdn ). 
Both Zdn and Wdn are indicator vectors with a single one. If the kth element 
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LDA (for modeling text) 


MMSB (for modeling links) 


1. For each node d = 1 ,,D, draw mixed 
membership vector 9d ~ Dirichlet(a). 


1. Draw topic /3k ~ Dirichlet(^) for 
k = 1,..., K. 

2. For each document d = 1,..., D, 

(a) draw topic proportion 6b~Dirichlet(a). 

(b) For each position n = 1,..., Nd, draw 

• topic assignment 

Zdn ~ Multinomial)^). 

• word Wdn ~ Multinomial (/3 Zdrl ). 


2. For each node pair (d,d ), draw 

• membership indicator for sender 
Sdd' ~ Multinomial)#^). 

• membership indicator for receiver 
rdd' ~ Multinomial)#^/). 

• interaction y d d' ~ Bernoulli(H StJd , T . tld/ ). 


Fig 1. Generating process for LDA (left) and MMSB (right). These two models can be 
combined by identifying the communities in MMSB with the topics in LDA, that is, the 
topic proportions with the mixed membership vectors (both denoted by 6d). 


of Zdn is one, Wdn is drawn from the topic /3 Zdn , which refers to /3k (slight 
abuse of notation). 

On the other hand, MMSB is a mixed membership model for relational 
data that can detect communities within a network. Suppose the relational 
data is represented by a directed graph. For each node pair (d, d') where 
d ^ d!, we define the binary variable ydd' to be 1 if there is a directed edge 
from d to d' and 0 otherwise. The MMSB assumes that there are K latent 
communities (groups) and each node belongs to the K groups with varying 
degrees of affiliation. Specifically, each node d is associated with a K x 1 
membership vector, dd, drawn from a Dirichlet(a) prior, representing the 
probability of the node belonging to each of the K groups. Each node may 
assume different membership when interacting with different nodes. The 
blockmodel B is a K x K matrix where Bij represents the probability of a 
directed link from a node in group i to a node in group j. For each {d,d'), 
membership indicator vectors for the sender ( Sdd ') and receiver {rdd') are 
first drawn from Multinomial^) and Multinomial^') respectively. If the 
ith and jth elements of Sdd' and rdd' are ones respectively, the value of the 
interaction ydd> is sampled from Bernoulli(i? S(J(J , rdd ,) where B Sdd , rdd , refers to 
Bij. The generating process of LDA and MMSB are shown in Figure 1. 

Citation networks are a type of relational data where the nodes are the 
documents/articles and the directed links are the citations between them 
(there is a directed link from d to d' if d cites d'). Pairwise-Link-LDA (Nal- 
lapati et al, 2008) combines MMSB with LDA to jointly model text in ar¬ 
ticles and the citations between them by identifying the topics in LDA with 
the communities in MMSB. As links between documents indicate a certain 
level of similarity in topics, it is believed that network information, when 
suitably incorporated, would improve topic modeling (Kleinberg, 1999; Ho, 
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Generative process of LMV 

1. Draw topic /3k ~ Dirichlet(r/) for k = 1,..., K. 

2. For each document d — 1,..., D, 

• draw visibility T d ~ Beta(<?o, ho). 

• draw topic proportion 9d ~ Dirichlet(a). 

• For each position n = 1,..., Nd, draw 

— topic assignment Zdn ~ Multinomial^). 

— word wdn ~ Multinomial (/3 Zdrl ). 

3. For i, j £ {1,... ,K}, draw B t j ~ Beta(ao, 6 o)- 

4. For each document pair (d, d') where d 7 ^ d', 

• draw topic indicator for 
sender: Sad' ~ Multinomial (# d ). 
receiver: r dd / ~ Multinomial(0 d /). 

• draw y dd , ~ Bernoulli(r d /B Sdd(r<1<i ,). 



Fig 2. Left: Outline of LMV. Right: Graphical representation of a two-document segment. 
The complete model contains ydd 1 for every document pair. Circles denote variables and 
observed variables are shaded. The plates contain variables to be replicated and the number 
of times is indicated in the lower left corner. 


Eisenstein and Xing, 2012). 

2.2. Proposed model: LMV. In Pairwise-Link-LDA, any two documents 
with the same topic proportions have equal probability of being cited. This 
assumption is easily violated in real-world citation networks as factors other 
than research topics affect the citation probability. For instance, a well-cited 
document’s higher chances of being cited may be due to its quality, nov¬ 
elty and the authors’ social networks. Our proposed model aims to capture 
collectively via the visibility measure, attributes of the cited document that 
explains this variation in citation probability given compatibility in topics. 

Our proposed model is presented in Figure 2. The text is generated as in 
LDA. For the generation of links, we introduce a latent variable t# for each 
document d ', drawn from a Bet&(go,ho) prior, which modifies the citation 
probability by scaling the blockmodel. Given that the ith element of sygi 
and jth element of are ones, the probability of a document d from the 
All topic citing a document d! from the jth topic becomes Tg/Bij, which is 
dependent on the characteristic of the document d! receiving the citation. 

We define t# as the topic-adjusted visibility of document d', which is a 
scaling factor that adjusts the universal probability of being cited based 
on citation probabilities within/between topics. It is a characteristic of d' 
that accounts for the variation in citation probability among documents 
with equal topic proportions. This variation might be due to attributes of d! 
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which are unobserved, hard to quantify, or not directly modeled. The topic- 
adjusted visibility is potentially useful as a descriptive article level measure 
of citation likelihood that adjusts for the differences in citation practices in 
different topic areas. We use the acronym LMV for our model, taking the 
first letter from LDA, MMSB and visibility. 

Our model reduces to Pairwise Link-LDA (Nallapati et al., 2008) when 
Td' is identically one. One computational issue associated with LDA and 
MMSB is the multi-modality of these models, whereby multiple model con¬ 
figurations give equivalent fits to the data (see e.g. Ho, Parikh and Xing, 
2012). Combining content and connectivity information in a citation network 
may alleviate the multi-modality issue of MMSB. On the other hand, the 
communities detected in the citation network regularize the topic modeling 
on the content information. 

We assume that the hyperparameters ao, bo, go, ho, r] and a are known. 
The latent variables Sdd’ and rdd ' give rise to a more tractable joint distribu¬ 
tion, which leads to significant simplification of the variational optimization 
procedure to be introduced in Section 3. 

2.3. Review of associated methods. The relational topic model (RTM, 
Chang and Blei, 2010) also uses LDA as a basis for modeling document 
networks. RTM does not consider every pairwise interaction, however, and 
models only observed links. It uses a symmetric probability function with 
a diagonal weight matrix, which allows only within topic interactions. To 
improve the RTM, extensions have been proposed. For instance, Chen et al. 
(2013) define generalized RTM with a full weight matrix and perform regu¬ 
larized Bayesian inference where a log-logistic loss is minimized. A regular¬ 
ization parameter is used to control influence from link structures. Zhang, 
Zhu and Zhang (2013) propose sparse RTM where normal and Laplace pri¬ 
ors are placed respectively on the topic and word representations, and word 
counts are Poisson distributed. Different regularization parameters are used 
for links and non-links in the minimization of a log-loss. 

Some other extensions of LDA to document networks include the Top- 
icBlock (Ho, Eisenstein and Xing, 2012), which uses text and links to in¬ 
duce a hierarchical taxonomy, and block LDA (Balasubramanyan and Co¬ 
hen, 2013), which considers documents annotated with entities and models 
only realized links between entities using a stochastic blockmodel. Zhu et 
al. (2013) propose a Poisson mixed-topic link model that combines LDA 
with a variant of the MMSB, called the Ball-Karrer-Newman model, where 
the number of links between two documents is Poisson distributed instead 
of Bernoulli. Neiswanger et al. (2014) present a latent random offset model 
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that augments the topic proportions of the cited document with a vector to 
capture contents of citing documents in link predictions. These models do 
not address the issue that documents with similar topics may have different 
connectivity due to unobserved factors of impact. 

To improve the understanding of large text corpora, models that incor¬ 
porate additional information about the corpus or document metadata into 
text analysis have also been introduced recently. These models investigate 
the relationship between text data and observed variables that may affect 
the text composition (e.g. authors, date, political affiliation and quality rat¬ 
ings). To overcome the difficulty of incorporating high-dimensional text data 
into statistical analyses for predicting sentiment variables, multinomial in¬ 
verse regression (Taddy, 2013) uses the inverse conditional distribution of 
text given sentiment to obtain low-dimensional document representations 
that preserve sentiment information. The inverse regression topic model 
(Rabinovich and Blei, 2014) extends multinomial inverse regression to the 
mixed-membership (multiple topics) setting, while distributed multinomial 
regression (Taddy, 2015) tackles high-dimensional sentiment variables by 
modeling document-word counts as independent Poisson distributed vari¬ 
ables. Roberts et al. (2013) propose the structural topic model, which uses 
generalized linear models as priors to incorporate document-level covariates. 
In this paper, we adopt a different approach to the exploration of large text 
copora by modeling text and links between documents jointly. Further incor¬ 
porating document metadata into our proposed model will be an interesting 
direction for future research. 

3. Posterior Inference. As the true posterior of our model is not avail¬ 
able in closed form, we develop an efficient variational algorithm for posterior 
approximation. Let 0 denote the set of unknown variables in the LMV. In 
variational methods, the true posterior is approximated by more tractable 
distributions which are optimized to be close to the true posterior in terms 
of Kullback-Leibler divergence. Here we consider a fully-factorized family, 

<?(@)=n{®(wn qM{zdn\4>dn ) JJ [<?M (sdd> |)qM (rdd 1 \"dd 1 )] } 

d n d'^d 

x II 9u(/3fc|Afc) J^[ qs(Bij\aij, bij ) ]^[ qB^lgd^hd 1 ), 

k i,j d' 

where qp, qm , qB denote the Dirichlet, multinomial and beta distributions 
respectively and {A, 7 , </>, n, u, a, b, g, h} are variational parameters to be op¬ 
timized. Discussion on assumptions made in the variational approximation 
and their implications can be found in the Supplement. 
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From Jensen’s inequality, minimizing the Kullback-Leibler divergence be¬ 
tween q(Q ) and the true posterior is equivalent to maximizing a lower bound 
L on the log marginal likelihood, where L is given by 

(3.1) £ E ? [logp( 2 / dd '|r d /, B, s dd ',r dd >) + \ogp{s dd '\6 d ) + log p(r ddl \9 d f)] 

(d,d') 

+ ^2E q [logp(z dn \9 d ) + log p(w dn \z dn , /?)] + y^ y Eq[\ogp(Bij\a 0l b 0 )] 

d,n i,j 

+ ^2 E q[ l °gP( T d\go, ho) + \ogp(6 d \a)\ +^2E q [logp(/3 k \ij)] +H(q). 

d k 

In (3.1), Eg denotes expectation with respect to g(0) and H(q) denotes the 
entropy of q. All terms in £ can be evaluated analytically except E g {log(l — 
r d iBij)}. We expand this expectation using a first-order approximation about 
the mean (Braun and McAuliffe, 2010) so that 

Eg{log(l - T d iBij )} « log(l - Eg(r d /)E q (B ii )) 

9d' O'ij \ 

Hd! T h d ’ CLij T bij ) 

The approximate lower bound obtained using (3.2) is denoted by C*. Dis¬ 
cussion on the first-order approximation and the expression for C* can be 
found in the Supplement. 

We optimize C* with respect to the variational parameters via coordinate 
ascent (see Algorithm 1). For {A, 7 , 0 , k, v}, closed form updates can be 
derived by differentiating C* with respect to each parameter and setting the 
gradient to zero. For {a, b, g, h}, the likelihood is nonconjugate with respect 
to the prior and we use nonconjugate variational message passing (Knowles 
and Minka, 2011). This is a fixed point iteration method for optimizing 
the natural parameters of variational posteriors in exponential families. The 
advantages of this approach is that it yields closed form updates and extends 
to stochastic variational inference naturally. However, C* is not guaranteed 
to increase at each step and updates for {a, 6 , g, h} may be negative at times. 
To resolve these issues, we use the fact that nonconjugate variational message 
passing is a natural gradient ascent method with step size 1 and smaller step 
sizes may also be taken. In Algorithm 1, we start with step size 1 and reduce 
the step size where necessary to ensure updates of {a, 6 , g , h} are positive. If 
C* increases, these updates are accepted. Otherwise, we revert to the former 
values. Updates for {a, b,g, h} are derived in the Supplement. As updates of 
{a, b} and {g, h} are highly interdependent, we introduce a nested loop for 
cycling these updates in step 5 of Algorithm 1. 


13-2J 


= log 1 
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Algorithm 1 Coordinate ascent procedure for the LMV 

Initialize A, 7 , 0, k, v, a, b , g and h. Cycle the following updates until convergence 
is reached. 

1. For each document pair (d,d r ), cycle the following updates until n dd i an d Vdd' 
converge. 


Kdd'i oc exp |v>(7di) - 0 fE- 7 dij + Vdd'jSdd'ihj)} for i = 1,..., K. 
v dd 'j oc exp |?/>(7d'j) - 0 (E. 7 d'j) + E* Kdd'iSdd'tt, j)} for j = 1,..., A, 


where g d d'(i,j) 


ip(a i:j ) - ipfaj + bij) + ip(g d >) - 0( g d > + h d >) 



1 - 


9d' _ a ij 

9d'+h d / dij+bij 


if Vdd' = 1, 
if 2 /dd' = 0 . 


2. For d = 1,..., D, n = 1, ..., IV^, k = 1, ..., K, 


<t>dnk oc exp { 0(7 dk) - 0 (E fe 7dfc) + E„ Wdnv 0(A fclI ) - 0 (E l A fc „^ |. 


3. 

4. 

5. 


For d = 1,..., £>, 7 d = a + J2 n fan + ^2 d 'jtd( K dd' + v d’d)- 
Foi k 1, . . . , K , 1? 1, . . . , V, X kv tjv T ^^ d ^dnvPdnk' 

Cycle updates in (a) and (b) until convergence is reached. 


(a) For i = 1,..., K, j = 1,..., K, 


(1 - s t ) 


■ s t 



where 


CLij 


°0 + Yl(d,d'): y ddl = 1 K dd'iVdd'j 

Pij. 


bo 


(o*j + bij)ip'(aij + ) - hj'tp'(bij) 

aijip'(a,ij) - (dij + bij)ip'(a,ij + 6 ^) 



Kdd'iVdd'] g J+ hd , 

/ -J ^ _ ® d> _ a ij 

(d,d')-. y dd /= 0 9 d '+h d r dij+bij 


Start with s t = 1. If any a,; ? < 0 or b^ < 0, reduce s t (say by half each time) 


until all dij > 0 and bij > 0. Accept update only if C* 


increases. 


(b) For d! = 1, 


■ T). 


9d' 

h d > 


(1 _ s t) 


9d> 

h d ' 


+ St 


9d' 

h d ' 


where 


9d> 


90 + X/d Vdd' 

hd'. 


ho 


(. 9d' + h d ')ip’(g dl 
gd'4>'(gd') - (gd' 


\Ig d i> h d’ I ^9d’ + hd') 2 

Sd: y dd i= 0 K dd'i^dd'j ai j+bjj 


hd') - h d 'ip'(h d ') 
hd')i>'(g d ' + h d >) 


E 

hi 


1 - 


9 d > 


9 d '+h d i dij+bij 


Start with St = 1. If any g d > < 0 or h d ' < 0, reduce St (say by half each time) 
until all g d ' > 0 and h d ' > 0. Accept update only if C* increases. 

Note: \I a ,b\ denotes determinant of the Fisher information matrix of Beta(a, b). 

See Supplement. 
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4. Stochastic optimization of variational objective. We develop 
an online variant of Algorithm 1 that scales well to large networks using 
stochastic variational inference (Hoffman et al., 2013). In this approach, 
variational parameters are classified as local (specific to each node) or global 
(common across all nodes) parameters. At each iteration, a minibatch of 
nodes are randomly sampled from the whole dataset and local parameters 
corresponding to these nodes are optimized. Global parameters are then up¬ 
dated based on optimized local parameters using stochastic gradient ascent 
(Robbins and Monro, 1951). The algorithm converges to a local maximum of 
the variational objective provided the step sizes and the objective function 
satisfy certain regularity conditions (see Spall, 2003). 

Currently, Algorithm 1 has to update the variational parameters Kdd' and 
Udd' for each document pair {d, d!) at every iteration. This computational cost 
scales as 0(D 2 ) and makes our model infeasible for large networks. To apply 
stochastic variational inference, we regard Kdd 1 and Vdd 1 as local parameters 
and perform these updates only for a random subset of all document pairs 
at each iteration. Remaining variational parameters are regarded as global 
parameters and are updated using stochastic gradient ascent. 

4.1. Proposed sampling strategy. We devise a novel scheme for sampling 
document pairs. While simple random sampling is a possibility, it does not 
utilize information provided by the links. Raftery et al. (2012) propose a 
stratified sampling scheme where stratums are defined by shortest path 
lengths. Assuming “closer” nodes contain more information and that large 
networks are often sparse, they sample all links for each node and only a 
small proportion of nonlinks from each stratum. Gopalan and Blei (2013) 
consider “informative set sampling”, where “informative set” for a node 
consists of all links and nonlinks of path length 2. Remaining nonlinks are 
partitioned into “noninformative sets”. At each iteration, either an “infor¬ 
mative” set is chosen with high probability or one of the “noninformative 
sets” is chosen with low probability. These schemes are not directly appli¬ 
cable to our model. To update 7 d (Algorithm 1, step 2), unbiased estimates 
of Yhd’^d K dd' and J2d’^d 1J d'd are required. That is, samples has to be drawn 
from cases where d is the citing document as well as cases where d is the 
cited document. As Gopalan and Blei (2013) treat links as undirected while 
Raftery et al. (2012) do not subsample documents, they do not face these 
restrictions. 

We propose the following sampling scheme. Consider the adjacency matrix 
y of ones and zeros, where the rows and columns denote the citing and cited 
documents respectively (diagonal is undefined). We associate each document 
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pair (d,d') with an inclusion probability TTdd'i where ndd 1 is a decreasing 
function of the shortest path length from d to d'. While other definitions are 
plausible, we define for simplicity, 


(4.1) 


T^dd! 


1 Add' Ud’ > no, 
1/no otherwise, 


where Idd> denotes the shortest path length from d to d! and no is a positive 
integer. When y ( id‘ — 1, vr dd’ = 1- Hence all links are included and more 
“informative” nonlinks (in the sense of shorter path length) have a higher 
probability of being included. In the examples, we set no = 100. This implies 
that document pairs with a shortest path length of 100 or more have a 
probability of 0.01 of being included. It is possible to experiment with other 
values depending on the application and computational constrains. A smaller 
no implies a larger sample of document pairs. At each iteration, we 

1. select a random sample S of |<S| documents from the whole dataset. 

2. perform a Bernoulli trial with success probability 7r^' for each (d, d') 
where d £ S or d' £ S. 

3. select document pairs with successful trials (denote this set as V). 

The Bernoulli trial is performed only once for each document pair even if 
both d and d! are in S .The sampling strategy is illustrated in the Supplement. 


4.2. Stochastic variational algorithm. In the stochastic variational algo¬ 
rithm (Algorithm 2), updates for Kdd 1 and Vdd' are cycled until convergence 
for each (d, d') £ V so that local parameters are optimized at the current 
global parameters. The global parameters are then updated using stochastic 
gradient ascent. For a parameter Aj, we consider an update: 

(4.2) Ai<- \i + s t V Ai r, 

where St denotes a small step taken in the direction of V^X* (natural 
gradient of C* with respect to A {). In variational Bayes and nonconju¬ 
gate variational message passing, the natural gradient (Amari, 1998) is 
Va,X* = A* — Aj, where A* is the optimal update of A*. See Supplement 
for details. Hence the update in (4.2) can be written as 

Aj £- (1 — St)\i + StXi- 

In stochastic gradient ascent, we replace the true natural gradients with 
unbiased estimates. For convergence, the step sizes should satisfy the con¬ 
ditions St —> 0, s t = 00 an d s t < °°- 
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Algorithm 2 Stochastic variational procedure for the LMV 

Initialize 7 , A, k. cf>, 17 a , b , < 7 , h. At each iteration, 

1. Obtain a random sample S of |<S| documents from the corpus. 

2. For each document pair (d, d'), where d G S or d' G S, perform a Bernoulli 
trial with success probability 7 idd' ■ Let V denote the set of document pairs with 
successful trials. Let Vd■ = {(/, /') G V\l = d} and V. d = {( l, l') G V\V = d}. 

3. Update ddd 1 and Vdd' iteratively as in Algorithm 1 for each (d, d') € V until 
convergence. 

4. Update 4>dnk for d € S, n = 1,..., N d and k = 1,..., K as in Algorithm 1. 

5. For d € S, 7 d (1 - s t )+ s t % where 

% = a + J 2 ^n+ E —+ E —■ 

Z L ' TTll' L ' TTii' 

n (i,nev d . (U')ev. d 


6 . For k = 1,..., K and v = 1,..., V, Afc„ t— (1 — St)\kv + StXkv > where 

A kv ~ dv 4” 7“wr EE dldnv 0d/nk ■ 

I I deS n 


7. For i = 1,..., K and j = 1,..., K, 


13 1 <- (1 - s t ) r* 3 

°ij J 


St 




where 


a Jj 


a 0 + |S| Sd'eS 22(ld')eV. d r.y u ,=l K ll'i v ll'j 

Pij. 


bo 


1 


(ciij + bij)ip'(a,ij + bij) - biji>'{bij) _ 

aijip'(aij) - (a i:j + b i:j )ip'(dij + bij )J |S| £ 


I laijMj I ( a ij + bij) 2 

K ll'i v ll'j 9d' 


D ST' ^(U')e'P. d ':gll'=0 7T„/ 9d’+ h d’ 

Ts\ 2^ ~ 


1 - 


gd' a O 

Sd'+^d' Oij+U 


If any a.y < 0 or bij < 0, reduce St for this update (say by half each time). 


3. For d' G 5, 


9d' 

hd’ 


(1 - s t ) 


9d ' 
hd’ 


+ S( 


9d’ 

hd’ 


, where 


y 


go + X/d J/dd' 

pd'_ 


ho 


14 


Pd' »' l d / 


(g# + h d ’W (gd> 

gd'ip'(gd') - (sd' 


/id') - h d 'ip'{h d ' 

hd'Wigd' + /idc 


|(5d' + h d ') 2 

T,(i,i’)er. d ,-.v u ,=0 


E 


n ll' 


j ~\~bi 


1 - 


9d' _ a ij 

9d'+h d / aij+bi . 


If any < 7 ^ < 0 or h d ’ < 0, reduce s t for this update (say by half each time). 




TOPIC-ADJUSTED VISIBILITY METRIC FOR SCIENTIFIC ARTICLES 15 


Using our proposed sampling scheme, unbiased estimates of the true natu¬ 
ral gradients can be computed via the Horvitz-Thompson estimator (see e.g. 
Kolaczyk, 2010). Let V d . = {(/,/') € V\l = d} and V. d = {(l,V) € V\V = d} 
denote samples in V lying in the dth row and column of the adjacency matrix 
respectively. We have, for example, the following unbiased estimates, 


(4.3) 


E 

(MO £P d . 


Kg' 

KW 


Y Kdd' and 
d'^d 


D 

LSI 


E 


K ll>i v ll'j 


mer. d , 

d'es 




^ ^ K'dd'i^dd' j • 
(d,d>) 


The estimator on the right arises from a two-stage cluster sampling scheme 
which involves sampling the columns ( d' E S ) in the first stage and then 
sampling document pairs from these columns in the second stage. We show 
that this estimator is unbiased in the Supplement. 

The stochastic variational algorithm for our model is outlined in Algo¬ 
rithm 2. Implementation details are given in the Supplement. 


5. Comparison with alternative approaches. We compare our model 
with the most representative peer methods, RTM and Pairwise-Link-LDA. 
As a baseline for comparing methods which integrate the modeling of text 
and links, we also consider “LDA + Regression”, which involves fitting an 
LDA model to the documents followed by a logistic regression model to the 
links. The covariates corresponding to the observation y dd i are 7d©7 d >, where 
© denotes the Hadamard product and the kth. element of 7^ is 'Idk/ Yhildl- 
This approach models text and links separately and information from topic 
modeling is not utilized in the link structure. The RTM accounts for both 
text and links structure. However, it assumes a symmetric probability func¬ 
tion and considers a diagonal weight matrix that allows only within topic 
interactions. In addition, RTM only models observed links and does not 
deal explicitly with the imbalance between links and nonlinks. We consider 
the RTM with an exponential link probability function. Pairwise-Link-LDA 
combines LDA and MMSB. Comparing Pairwise-Link-LDA with LMV illus¬ 
trates the importance of the visibility measure in link prediction. 

All models are estimated using variational methods and the code for these 
algorithms are reproduced in R by ourselves. For LDA + Regression, RTM 
and Pairwise-Link-LDA, we have tried to follow the implementations sug¬ 
gested by the original authors as closely as possible. Variational parameters 
in LMV, RTM and Pairwise-Link-LDA are initialized using the fitted LDA 
and the same priors are used across all models. Details on priors and stop¬ 
ping criteria are given in the Supplement. 
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6. Prediction and article recommendations. We discuss how our 
model can be used to predict links for new documents assuming knowledge 
only of the text. An important application of the predictive probabilities 
is in recommending scientific articles, for instance, to researchers searching 
for information on certain research topics or who are preparing manuscripts 
and looking for relevant articles to cite. Using a short paragraph of text or 
even just keywords (as text of the “new document”), predictive probabilities 
of links to documents in the training set can be computed and used as a 
means to rank documents and construct recommendation lists. As our model 
captures both inter- and intra-topic citation probabilities and estimates the 
visibilities of individual articles (which are adjusted for held variation in 
citation practices), it can identify relevant articles which are multi-topic or 
of high visibility but coming from topics with low citation rates. Wang and 
Blei (2011) and Gopalan, Charlin and Blei (2014) consider recommendation 
of scientific articles to readers of online archives based on article content 
and reader preferences. They do not consider citations amongst articles and 
make use of collaborative filtering. 

The predictive probability of a link to a document in the training set can 
be computed as follows. First, we fit the LMV model to documents in the 
training set. Then we perform variational inference on the new document d 
to obtain its topic proportions (see e.g. Nallapati et al. , 2008). That is, we 
iterate till convergence the updates: 

1 • 7ci ^ T 4*dm 

2. cj) dnk oc exp {VKt dk) ~ V’ (J2k 7 dk) + w dnv [^(A kv) - CC„ Afc„)]} 

for n = 1,..., N d , k = 1,..., K, 

where A is obtained from the fitted model. The first update is similar to 
step 3 of Algorithm 1. In this case, we do not assume knowledge of the 
links of the new document. Hence parameters associated with the links are 
absent. Let w d and wt denote the words of the new document d and the 
training set respectively and let y t denote links within the training set. 
Approximating the true posterior by the variational approximation q(Q), 
the posterior predictive probability that d will cite any document d! in the 
training set is 

P{Vdd' = l\w d ,W T ,yT) ~ Eq(Td')Eq(d d ) T Eq(B)E q (e d r) 

C 6 - 1 ) = 9d' f 7 d \ T a "id! 

9d' + h d > V Y,k=i Idk) a + h Ef=i 7 d'k 

6.1. Predictive rank. In the examples, we compute the average predictive 
rank of held-out documents, following Chang and Blei (2010), as a way to 
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evaluate the fit between considered models and the data. The predictive rank 
captures a model’s ability to predict documents that a test document will 
cite given only its words. To compute the predictive rank of a test document 
for model M, first fit model M to documents in the training set. Using the 
fitted topics, obtain the topic proportions of the test document using just its 
words as described above. Compute the posterior predictive probability that 
the test document will cite each document in the training set (use (6.1) for 
LMV) and rank the documents according to this probability. The predictive 
rank is the average rank of the documents which the held-out document 
actually did cite. Lower predictive rank indicates a better fit, and it also 
implies that the articles that were actually cited are placed closer to the top 
of a recommendation list for a test article. 

7. Applications. We apply our methods to two real datasets. To assist 
understanding of our proposed model and to evaluate Algorithms 1 and 
2, a simulation study is also provided in the Supplement. In this study, 
we generate datasets from the LMV and demonstrate that Algorithms 1 
and 2 are able to recover the structure of the blockmodel B as well as 
the visibility of each document. We also show that our model performs 
significantly better than LDA+Regression, RTM and Pairwise-Link-LDA in 
link predictions. Predictive results from Algorithms 1 and 2 are very close 
and our subsampling strategy helps to reduce computation times. 

In the following, the blockmodels, visibilities, topic proportions and topic 
assignments of documents are estimated using the posterior means of corre¬ 
sponding variational approximations. For each citation, a hard assignment 
of topics to the citing and cited documents is obtained by taking the posi¬ 
tions of the maximum elements of Kdd' and Vdd' respectively. For instance, 
if K dd' = [0.8,0.2,0,..., 0] and h> ( id' = [0-3,0.7,0 ..., 0], we interpret the ci¬ 
tation as being from topic 1 to 2. For visualization of fitted topics, we order 
terms in the vocabulary using the term-score (Blei and Lafferty, 2009), 

term-score^ = X kv log < --— 

l (rifc=i ^kv) k 

where X kv = X kv / Yli ^ki denotes the posterior mean probability of the utli 
term appearing in the Lth topic. The second part of the expression down¬ 
weights terms that have high probability of appearing in all topics. This 
term-score is inspired by the TFIDF term score used in information re¬ 
trieval (Baeza-Yates and Ribeiro-Neto, 1999). Further discussion on ways to 
examine the quality of uncovered topics can be found in the Supplement. 
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Predictive Rank 


CPU Runtime 



Fig 3. Cora: Average predictive ranks and CPU times in seconds for different methods. 


We denote LDA + Regression and Pairwise-Link-LDA by LDA+Reg and 
PLLDA respectively. Methods with subsampling have a “S” added at the 
end, e.g. LMVS denotes LMV with subsampling. If publication times are 
taken into account, we add a “(f)” at the end. 

7.1. Cora dataset. The Cora dataset from the R package Ida (Chang, 
2012) has 2410 documents, 4356 links and a vocabulary of 2961 terms. This 
dataset is relatively small and it allows us to compare the predictive per¬ 
formance and CPU times of Algorithms 1 and 2. We randomly divide the 
dataset into five folds; each fold is used in turn as a test set and the re¬ 
maining folds are used for training. During training, only documents in the 
training set and links within them are used. We investigate the predictive 
performance of different models for number of topics K ranging from 5 to 
13. For Algorithm 2, we set the minibatch size as 200. 

7.1.1. Predictive performance and computation times. The average pre¬ 
dictive ranks and CPU times of different approaches are shown in Figure 3. 
We consider the hyperparameter r/, which controls the concentration of the 
topic distribution, to be 0.1 or 0.5 (if r/ is small, the probability distribution 
on the vocabulary will be concentrated on a small number of terms). Ex¬ 
cept for the RTM, predictive performance of all other methods are better 
when r] is 0.5 as compared to 0.1. LMV achieved significantly better predic¬ 
tive performance than the other models and attained 60-76% improvement 
in predictive rank over baseline 2 . Predictive performance of Algorithm 2 is 
close to that of Algorithm 1 even with subsampling and computation times 
are reduced, particularly when p = 0.1. For large datasets, Algorithm 2 

“Predictive rank computed by random guessing is n f l where n is the number of doc¬ 
uments in the training set. 
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Visibility (LMV) 
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B y (LMV) 


Fig 4. Cora: Visibilities (left) and blockmodel elements (right) estimated by LMVS and 
averaged over 50 runs against corresponding qualities estimated by LMV. Points lying close 
to the dotted line (y = x) indicate good agreement between LMV and LMVS. 


presents an avenue for overcoming computational and memory constraints 
while maintaining the same level of predictive performance. Predictive per¬ 
formance for the LMV stabilizes at around 9-11 topics when r/ = 0.5. In the 
following, we concentrate on the fitted models corresponding to K = 9 and 
r] = 0.5 for one of the folds. 

7.1.2. Evaluating accuracy of Algorithm 2. Repeating the LMVS (Algo¬ 
rithm 2) runs 50 times, we compute the average visibilities and blockmodel 
over these 50 runs and plot these quantities against corresponding values 
estimated by LMV (Algorithm 1) in Figure 4. There is very good agreement 
between the visibilities and blockdiagonal elements estimated by Algorithms 
1 and 2. For the left plot, there is greater variation near zero while the biggest 
two values appear to be slightly overestimated by LMVS in the right plot. 

7.1.3. Visibilities of individual articles. Figure 5 plots the visibility of 
documents in the training set estimated by Algorithm 1 against their cita¬ 
tion counts. There is a general trend of visibility increasing with citation 
counts. Hence, the topic-adjusted visibility metric is capturing some de¬ 
gree of popularity. However, the relationship between visibility and citation 
counts is not monotone; a higher citation count does not necessarily imply 
a higher measure of visibility. The visibility metric Ty captures a complex 
mix of attributes of document d' that accounts for the variation in citation 
probability among documents with similar topic proportions. These include 
but are not limited to popularity. 

To illustrate how the incorporation of visibilities improves predictive per¬ 
formance, let us consider as an example, the test document “Some extensions 
of the k-means algorithm for image segmentation and pattern classification”. 
This article cited two documents from the training set: (1) A Theory of Net- 
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Fig 5. Cora: Visibility against number of citations for each document in training set. 


Table 1 

Cora: Citations, visibility and predictive ranks of two cited documents. 


Index 

Citations 

f 

LDA+Reg 

RTM 

PLLDA 

LMV 

LMVS 

1 

26 

0.96 

98 

108 

120 

10 

9 

2 

48 

0.98 

385 

1082 

336 

53 

61 


works for Approximation and Learning and (2) Self-Organization and Asso¬ 
ciative Memory. The citations, estimated visibility and predictive ranks of 
these documents are given in Table 1. The predictive ranks assigned by LMV 
and LMVS are significantly lower than the other methods. Interestingly, if 
visibility is not taken into account, the ranking by LMV will be similar to 
Pairwise-Link-LDA (120 and 357 for documents indexed 1 and 2 in order). 
However, factoring in the visibilities of these documents, which are much 
higher than the average of 0.42, improves predictive ranks tremendously. 
On the average, LMV improves predictive performance by more than 30% 
over RTM, PLLDA and LDA+Reg when r/ = 0.5. This provides an indica¬ 
tion of the favorable performance of LMV in article recommendation. 

7.1.4. Citation behavior among fitted topics. Figure 6 provides a visual¬ 
ization of the LMV fitted using Algorithm 1, where (a) shows the estimated 
blockmodel B and top words of each topic and (b) shows the citation activ¬ 
ity in the training set. Figure 6(b) is based on raw citation counts and tends 
to be dominated by topics with high citation frequency or large number of 
publications. On the other hand, Figure 6(a) shows the citation probabilities 
within/between topics. It provides a more structured and unbiased under¬ 
standing of the citation tendencies within/between topics. From Figure 6(a), 
topic 2 tends to be cited strongly by topics 3, 5-7 and 9. It also tends to 
cite nearly all other topics besides itself. Topic 9 tends to cite topics 1 and 
2, and topic 8 has the highest tendency to cite documents within its own 
topic. Information such as these are helpful in understanding how citation 
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(a) Blockmodel B and top words of each topic 



(b) Citation activity 


Fig 6. (a) shows the citation patterns between different topics and the topic-to-topic cita¬ 
tion tendencies. Each element Bij is represented by a belt from topic i to j and the width 
of the belt is proportional to Bij. The value on the inner arc of topic i represents the 
probability that topic i will cite any topic while the value on the outer arc represents the 
total probability that topic i will cite or be cited by any topic. The top 7 words of each topic 
are displayed in word clouds and the font size is proportional to the term-score, (b) shows 
the actual citation activity among different topics. The width of each belt is proportional 
to the number of citations between the topics connected by the belt. The value on the inner 
arc is the total number of citations originating from a topic while the value on the outer 
arc is the total number of citations coming from and going to that topic. The direction of 
the belts can be inf erred from the presence (at the origin) and absence (at the destination) 
of the inner arcs. 


behavior varies from one scientific area to another, and even among different 
research fields within a discipline. 

7.2. KDD high energy physics dataset. The high energy physics (HEP) 
dataset for the KDD Cup 2003 competition (see Gehrke, Ginsparg and Klein- 
berg, 2003) contains 29,555 papers added to arXiv from 1992-2003 and is 
available at http://www.cs.Cornell.edu/projects/kddcup/index.html. 
We consider the abstracts of 25,224 papers added between 1992-2001 and 
the 271,838 citations among them as the training set. We use 4064 papers 
which have cited at least one paper from the training set and were added 
between 2002-2003 as the test set. There are 60,914 citations from the test 
set to the training set. The vocabulary consist of 7211 terms after stemming 
and removal of stop words and infrequent terms. We have the dates when 
the papers were published online at the library of the Stanford Linear Accel- 
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Fig 7. KDD: Average predictive ranks (first row) and CPU times in hours (second row) 
of different approaches. The columns correspond to g £ {0.5,0.1,0.01} from left to right. 


eration Center (SLAC), as well as the year and month in which the papers 
were added on arXiv. As some papers were first published on SLAC and 
later added on arXiv or vice versa, we use the earlier of the two dates as 
the date when a paper is first available. Memory constraints render running 
the algorithms for Pairwise-Link-LDA and LMV in batch mode infeasible. 
Hence, we only use Algorithm 2 and Pairwise-Link-LDA is also implemented 
using our proposed subsampling strategy. Minibatch size of 2000 was used. 

7.2.1. Predictive performance and computation times. Figure 7 shows 
the average predictive ranks and CPU times 3 of different approaches for 
number of topics K ranging from 10 to 30 and p E {0.5,0.1,0.01}. LMV 
provides better predictive performance than Pairwise-Link-LDA and RTM 
for all r] and K, and taking into account publication times yields significantly 
better predictions. This is likely due to more accurate estimation of the 
visibilities as documents published at later dates are not penalized for not 
being cited by documents published before them. LMVS(t) achieved 73-83% 
improvement over baseline and optimal predictive performance at K = 20 
and t] = 0.5. In the following, we consider the model fitted by LMVS(t) 
when K = 20 and p = 0.5. 

li LDA was run in batch mode and CPU times are for model fitting only. We do not com¬ 
pute predictive ranks for LDA+Reg as logistic regression for this dataset is prohibitively 
expensive. Note that LDA may also be implemented with stochastic variational inference. 
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Fig 8. Blockmodel B Fig 9. Citation activity 

7.2.2. Interpreting citation trends in HEP. Figures 8 and 9 provide vi¬ 
sualizations of the estimated blockmodel and the citation activity in the 
training set respectively. These plots may be read as in Figure 6. The block- 
model in Figure 8 indicates high probability of within topic citation generally 
while across topic citation is much weaker in comparison. Figure 9 indicates 
that a large proportion of citations in the corpus occurred within topics 1-2. 

Next, we focus on topics 1-4, 6 and 8-9 where interconnectivity is higher 
among them. Figure 10 provides a visualization of the citation patterns be¬ 
tween these topics and reveals some interesting trends in the citation land¬ 
scape of HEP. First, there is a strong tendency for within topic citations 
while the probability of across topic citations is much weaker. However, the 
across topic citation relations that do exist are unsurprising as theoretical 
HEP deals with the fundamental aspects of Particle Physics and there are 
vast and deep links between these topics. It is also worth noting that certain 
topics do not have a high tendency to cite each other (e.g. topics 3 and 6) 
even though there are overlaps in the body of knowledge (e.g. supersym¬ 
metries and string theory). A possible reason is that HEP physicists may 
not always be aware of one another’s work (especially amongst the 3 main 
groups: experimental, phenomenological and theoretical). The articles from 
these groups have different focus and may constitute different topics. 

Topics 1 and 2 are both associated with string theory, which claims that 
the fundamental objects that make up all matter are strings (like rubber 
bands) instead of particles (e.g. Gubser, 2010). The emphasis of topic 1 
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Fig 10. KDD: Visualization of citation patterns for topics l-f, 6 and 8-9. Width of arrows 
are proportional to the citation strengths in the estimated blockmodel. Only elements greater 
than 0.001 are visualized. Dashed arrows represent elements less than 0.005. Top 8 words 
of each topic are shown. Font size within each topic is proportional to the term-score. 


is on branes, which are multi-dimensional membranes that generalize the 
concept of particle (zero-dimensional brane) and string (one dimensional 
brane) to higher dimensions. Topic 2 is associated with gauge theory and 
orbifold, which are both related to geometry. Duality, an important concept 
in string theory, relates branes (topic 1) to gauge theory and supersymmetry 
(topic 2). Thus, the citation relationship between topics 1 and 2 is expected. 
From Figure 10, these two topics also have a relatively higher probability 
of receiving citations from other topics. This could be socially driven by 
the authors’ inclination to cite papers from which their respective topics 
originated. There may also be a tendency for researchers to appeal to popular 
topics and cite earlier successful theories such as the gauge theory which is 
the fundamental edifice of high energy particle physics. 

The prefix “super” is found in topics 1, 2 and 6. It is associated with 
supersymmetric theories which attempt to unify bosons and fermions under 
one generalized scheme; by relating fractional spin to integral spins, and 
finally unifying force and matter particles. Links from topic 6 to topics 1 
and 2 are therefore reasonable and expected. Topics 1 and 3 are clearly re¬ 
lated, namely: brane, string and gravity. String theory holds the promise to 
unify gravity to the other fundamental forces in Physics at the expense of 
introducing more dimensions that needs to be compactified mathematically. 
In the case of topic 4, research concerning entropy may be cited in relation 
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Fig 11 . KDD: Visibility against number of citations for each document in training set. 


to the event horizon of the black hole and this constitutes within-topic ci¬ 
tations. Mini black holes may also be regarded as a collection of dbranes, 
resulting in citations from topic 4 to 1. For future investigation, it may be 
of interest to narrow the study to a certain HEP group (e.g. experimental, 
phenomenological or theoretical) and a period when a particular topic is in 
fashion. 

7.2.3. Visibilities of individual articles. As in Figure 5, Figure 11 plots 
the estimated visbility of documents in the training set against their citation 
counts. There is a general trend of visibility increasing with citations as 
before. However, the estimated visibility for each citation count now vary 
over a wide range. This range increases as the citation count increases from 0 
but starts to gradually decrease as the citation count exceeds 20. It is more 
evident here than in Figure 5 that visibility correlates with but does not 
increase monotonically with citations. Hence the visibility metric captures 
a complex mixture of attributes that includes but goes beyond popularity. 

7.2.4. Application: article recommendation. We use an example to illus¬ 
trate the advantages that LMV can provide in article recommendation. Us¬ 
ing the abstract of the article “Open Strings” from the test set as a query, we 
compute predictive probabilities of links from this article to each document 
in the training set using LMV, Pairwise-Link-LDA and RTM. In practice, 
keywords or a short paragraph of text may be used as search query if the 
abstract or manuscript is not available. Figure 12 lists the top fifteen recom¬ 
mended articles for each approach based on predictive probabilities (articles 
ranked closer to the top have higher probabilities of a link). Information such 
as topic proportions (in the form of barplots), title, citation counts, year of 
publication and visibility metric (for LMV) are also provided. This is a re¬ 
alistic representation of recommendation systems that can be constructed 
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0.96 

170 
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Aspects of Dualities 

1 

92 

Some remarks on N=2 extended supersymmetric Yang-Mills theories and ... 

0 

00 

Possible Implications of the Quantum Theory of Gravity 

0 

99 

A View From the Island 

1 

97 

Reminiscences of Collaborations with Joel Scherk 

0 

96 

What is quantum field theory and why have some physicists abandoned it? 

5 

94 

A Note on the Neutrino Theory of Light 

8 

93 

What is Quantum Field Theory, and What Did We Think It Is? 

12 

92 

The Quest for Understanding in Relativistic Quantum Physics 

0 

00 
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3 

99 
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0 

97 

Questions in quantum physics: a personal view 
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About Symmetries in Physics 

0 
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Fig 12. KDD: Top 15 articles recommended by LMV, Pairwise-Link-LDA and RTM. 
Asterisks indicate articles actually cited by test article. This figure is intended to be read 
in color (color version available in online publication). 
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based on our proposed model. 

In this example, five of the fifteen articles (marked by red asterisks) rec¬ 
ommended by LMV were actually cited while none recommended by RTM 
or Pairwise-Link-LDA were cited. Comparing the topic proportions of the 
test article with those of the recommended articles, we note that RTM tend 
to recommend articles with high proportions of topic 9. This is likely because 
RTM only allows within topic interaction and the test article exhibits high 
proportions of topic 9. On the other hand, Pairwise-Link-LDA and LMV are 
able to model across topic citation tendencies and the probability of topic 
9 citing topics other than itself is quite high (see Figures 8 and 10). Hence 
Pairwise-Link-LDA recommends articles mainly from topics 1 and 3 while 
articles recommended by LMV exhibit a higher degree of mixing with topics 
from 1-3 mainly and smaller proportions of some other topics. The articles 
recommended by LMV tend to have more citations than those recommended 
by Pairwise-Link-LDA and RTM. This is because the visibility metric cap¬ 
tures a certain degree of popularity as shown in Figure 11. However, articles 
are not ranked based on citations alone; both topic compatibility and article 
visibility play a part in the ranking. For LMV, there is also a good mix of ar¬ 
ticles published from 1993-2000. While our model does not explicitly model 
time from publication, the visibility metric accounts for this to a certain 
degree in that old articles which have accumulated many citations will high 
visibility, but so will recent articles which have managed to garner a propor¬ 
tionately smaller number of citations. This example highlights some of the 
advantages that LMV has to offer for article recommendation such as being 
able to identify relevant multi-topic articles and taking article visibility into 
account in rankings. 

7.2.5. Visibility as a topic-adjusted measure. We examine the behavior 
of the visibility measure by plotting the estimated visibilities by topic and 
by year of publication in Figure 13. We note that the visibilities do not vary 
radically from one topic to another and are in fact much more comparable 
across different topics than the average number of citations per article. The 
boxplots of visibilities by year does not display any systematic correlation 
with time of publication as well. Interestingly, even as the average number 
of citations per year is decreasing rapidly for articles published in the later 
years, the visibilities are not showing any decreasing trend. This could be 
due to the tendency to cite the newest and latest research. 

Figure 14 is a barplot showing the topic proportions of documents with 
40 citations. We have fixed the number of citations in order to study how 
visibility varies with topic proportions. As visibility increases, there is a 
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Fig 13. KDD: First row shows a barplot of the average number of citations in each topic 
(left) and boxplots of the visibilities of training documents classified by topic (right). Second 
row shows a barplot of the average number of citations in each year (left) and boxplots of 
the visibilities of training documents classified by year (right). 
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Fig 14. KDD barplot: each bar represent the topic proportions of a document with )0 ci¬ 
tations. The documents are arranged by visibility in increasing order. The legend indicates 
the color representing each topic. The topics are ordered by average number of citations 
received; topic 1 has the highest and topic 20 the lowest average number of citations. This 
figure is intended to be read in color (color version available in online publication). 


transition in the color of the bars from being dominantly blue to red. That 
is, for the same citation count (40 in this case), the estimated visibility of an 
article from topics with low citation rates (red) tends to be greater than an 
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article from topics with high citation rates (blue). This phenomenon demon¬ 
strates that the visibility metric adjusts for differences in citation frequency 
among different topics by assigning higher visibilities to articles from topics 
with lower citation rates for the same citation count. For example, consider 
an article from Mathematics and another from Molecular Biology having 
the same number of citations. As the citation rate in Mathematics is much 
lower than in Molecular Biology, it is inappropriate to compare the raw ci¬ 
tation counts of these articles directly. Citations in Mathematics ought to 
be assigned higher weights and this can be achieved through normalization 
procedures, for example by dividing citation counts by the average number 
of citations per article for a discipline (Radicchi, Fortunatoa and Castel¬ 
lano, 2008). However, the choice of a suitable reference standard is a very 
intricate issue. The topic-adjusted visibility metric that we propose offers 
an alternative to tackle the problem of field variation. 

8. Conclusion. Citation activities of scientific applications and other 
article-centric activities (shares on social web, for instance) are of interest 
for the evaluation of scientific merit and impact of published research. In 
particular, the citation network among published articles is a special case 
of a relational network. Communities detected in a citation network has 
been shown to correlate well with scientific areas and research topics. A 
unique feature of a citation network is that content information is avail¬ 
able on individual nodes, which is arguably influenced by the same mixed 
group structure underpinning the citation connectivity. Combining content 
and connectivity information in a citation network may alleviate the multi¬ 
modality issue of community detection in network analysis. On the other 
hand, communities detected in the citation network regularize the topic 
modeling on the content information. 

The probability of being cited for a particular article is determined not 
only by its membership in one or more topic domains and the topic-level 
citation rates (within and between domains), but also factors that are unique 
to this publication such as timing, visibility of the authors, and novelty and 
importance of the results. In this paper, we introduce a model for citation 
networks that infers the topic domain structure of the articles and their 
citation links, and estimates the citation activity rates at both the topic 
domain level and the article level. For each article, we introduce a latent 
variable that serves as a topic-adjusted visibility metric of this article. A 
higher value of this latent metric indicates that this article is more likely to 
be cited than other articles that are located close-by in the topic domain. As 
we have shown in our application to real datasets, this metric correlates with 
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raw citations counts but is not merely a measure of popularity, as it accounts 
for variation in activity levels among different topics. The proposed model 
leads to significant improvement in link predictions, which can be helpful in 
article recommendation. It provides a better visibility metric for comparing 
individual scientific publications across different fields. 

The inference of the proposed model is realized via a novel variational 
Bayes algorithm. For real-world large document networks, we propose a sub- 
sampling strategy that enables the use of stochastic variational inference, 
which is computationally efficient and achieves a similar level of predictive 
performance as the variational Bayes algorithm. 
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SUPPLEMENTARY MATERIAL 

Supplement: Topic-adjusted visibility metric for scientific arti¬ 
cles 

(doi: COMPLETED BY THE TYPESETTER; .pdf). We provide additional 

material to support the results in this paper. This include further discus¬ 
sions, detailed derivations, illustrations and a simulation study. 
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SI. Variational inference for the LMV. In the variational approx¬ 
imation for the LMV, we assume that (1) the variational posterior is of 
a factorized form, and (2) variational posteriors of Bij and t# are in the 
same family as their priors. The other distributions arise naturally as the 
optimal densities under the product density restriction. For and t#, 
the likelihood is nonconjugate with respect to the prior. Hence we have 
assumed the above parametric distributions. The factorization assumption 
may have a degrading effect on inference if the variables have a high degree 
of dependence in the posterior. However, if posterior dependence between 
variables is weak, then variational approximation can lead to very accurate 
approximate inference. It has been observed that variational approximation 
is often able to capture the means accurately but tends to underestimate 
the posterior variance (Wang and Titterington, 2005; Bishop, 2006). Vari¬ 
ational approximation can also lead to excellent predictive inference (e.g. 
large-scale discrete choice models, Braun and McAuliffe, 2010). Note that 
in variational inference, the fitted model is only guaranteed to be optimal 
locally (not necessarily globally). A common strategy to overcome this issue 
is to run the variational algorithm starting with random initialization from 
multiple starting points. 

Sl.l. First-order approximation about the mean. In the variational infer¬ 
ence for the LMV, we encounter the term E’ g {log(l — t^B^)} which cannot 
be evaluated analytically, and we approximate it by taking the expectation of 
a first-order Taylor series expansion about the mean. Let f(X) = log(l — X), 
X = TjiBij and p = E g (X), where E g denotes expectation with respect to 
the variational posterior. It is assumed in q that and Bij are independent, 
Q( T d') = Beta(<7d'! h#) and q(Bij) = Beta (aij,bij). For a Beta (a,/3) random 
variable Y, E{Y k ) = 0 ff^ k 1 _ l E(Y k ~ 1 ) < 1. Therefore, the Taylor series of 
X is well-defined as all the moments of X are bounded and decreases as k 

^Corresponding author: tzheng@stat.columbia.edu 
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increases. Taking expectation of the Taylor series of f(X) about /x, we have 


(Sl.l) 


f{X ) = m + f'itiiX - /x) + ^(x - /x) 2 + ..., 
=* E q {f(X)} = /( M ) + ^Var,(X) + .... 


The first-order approximation E q {f(X)} ~ /(/x) is exact when the sup¬ 
port of X converges to a point. From Jensen’s inequality, we also have 
E q {f(X)} < f(n) as / is concave. While the first-order approximation is 
crude, we reason below that it often works well in our context. 

The kth term in the second line of (Sl.l) is given by — E g {(X— p,) k }/k(l — 
/x) fc . In the limit as p, —> 0, this term approaches —E q (X k )/k and thus the 
error in the first-order approximation is bounded by 0(E q (X 2 )). Note that 
H —> 0 implies h# g# or bij S> ay, in which case, E q (X 2 ) —>• 0 as 


E q (X 2 ) = E,(r|)E,(S?.) 


9d' + 1 9d' dij Q-ij + 1 

9d' + hd' + 1 9d' + hd' aij + bij a,ij + bij + 1 


In the context of document networks, it is often the case that /x is close to 
0 as the sparsity of the network (small number of observed links) translates 
into a low probability of a document from topic i citing another document 
from topic j. Since this probability is approximately ay / (ay + fry) in the 
variational posterior, fry is usually much larger than ay. 


SI.2. Approximate lower bound. We present the complete expression for 
the approximate lower bound C* discussed in Section 3. Using (3.1) and 
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(3.2), C* can be evaluated as 
^ ' ylv T ''y ' W dnv (f) dnk X kv J ’Pi^kv) V’ ( ^ ^ kvj ^ ' log T ( ^ ^ Trffc 

k,v ^ d,n ' L \ ^ / J V fc 

E Pdnk log (pdnk + ^{log B(ajj,bij) - log E(a 0 , bp)} + y^logr(7rf fc ) 


d,n,k 


h3 


d,k 


^ ^ ^ ^ 4>dnk ^ ^ {^dd'k ^d'dk ) Tdfc j ^PiUdk) | ^ ^ Td/c 

d,/c ' n d'/d 


+ £ 


log r (E 0 ^) -E log r(afc)j + K \ log r (e*)-e logT(r] v ) 


+ EE Kdd'iVdd'j^Vdd 1 VP{a%j) - ^(aij + hj) + 'p(gd') ~ i>{gd' + h d ')\ 

(d,d') i,j 

+ (1 - v**) log i 1 - a) } + E log r ( A ^) - E log r (E a ^' 


+ Ei( ao “ a ij)bP( a ij) - '•Pfaij + bij)]+(b 0 - bij)[ip(bij) - ip(cLij + bij )]} 

+ E^- 9 o - gd')VP(gd') - V>( gd' + h d >)] + (h 0 - h d >)[^{h d ') - ip(g d ' + h d ')]} 


d 1 


+ E log B (9d',h d ’) —Dlog B(g 0 , h 0 ) - E ( n dd ' k log K dd ' k + v dd , k log v dd ' k ), 
d' ( d,d'),k 

where r(-), ip{-) and £>(•, •) denote the gamma, digamma and beta functions 
respectively. 


SI.3. Deriving updates. Next, we show how closed form updates for 
{a,b,g,h} discussed in Section 3 can be derived using nonconjugate vari¬ 
ational message passing. First, we review some general results. Suppose 
< 7 ( 0 ) = WlLigiiQi) and qi(@i) is a member of some exponential family 
such that 

Qi{&i) = exp{Afti(0i) - hi(Xi)}, 

where A* is the vector of natural parameters and tj(-) are the sufficient 
statistics. The ordinary gradient of the lower bound C = E q {logp(y, 0)} — 
Epilog q(0)} with respect to A j is 

V Al £ = V Al E g {logp(y,0)} - CovqJt^OijJAi, 

where Cov^ [t*(0i)] represents the covariance matrix of tj(0j). As qi(&i) is an 
exponential family member, Covqjfj(0j)] is equal to the Fisher information 
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matrix of q t . The natural gradient can be computed by pre-multiplying the 
ordinary gradient with the inverse of the Fisher information matrix (Amari, 
1998). Thus the natural gradient of C with respect to Aj is given by 

(51.2) V A X = Cov^iUiQ^V^Egilogpiy, 0)} - A, 

Setting V a ,£ = 0 leads to the fixed point update: 

(51.3) A i <5- Cov 9i [^(0j)] _1 V Ai ^g{logp(y,0)}. 

In Section 4.2, we write the natural gradient in (SI.2) as Aj — A i, where 
A i = Cov qi [ti(@i)]~ 1 \/\ i E q {logp(y,Q)} is simply the nonconjugate varia¬ 
tional message passing update in (SI.3). In the case of conjugate priors, A* 
is independent of A* and reduces to the update of A i in variational Bayes. 
See for example Tan and Nott (2014). 

The variational posteriors q(Bij) = Beta(ajj, bij) and q(rd>) = Beta (g#, h,i:) 
belong to the exponential family. Thus we can derive updates for them using 

(51.3) . The natural parameter of Beta(a, b) is [a — 1 b — 1] T . Some useful 
results which can be easily verified are stated below. 


Results 1. Let I ah = 


if'(a) — if'(a + b ) —'ll;'(a + b ) 

— if(d + b) if'(b) — if'(a + b) 


be the 


Fisher information matrix of Beta(a , b). 


(a) I 


-1 

a,b 


(b) I- 


-l 


1f'(d 

+ 

b) + 

[bo 


1 


0 


— (a 0 - 1 )if'(a + b) + (bo - l)[if'(b) - if'(a + b )] 
if'(a) - if'(a + b) 

—if'(a + b) 



ao - 1 


b 0 - 1 


Using (SI.3) and Results 1, the updates of aij and bij are given by 


Clij 1 

= I~ x 

-V aij Eg{logp(y,@)} 

bij — 1 

dij fiij 

y bi jE q {\ogp(y, 0)} 


j -i f (ao - 1 ){if'(aij) - if'(aij + bij)} - (b 0 - l)if'(aij + bij) 
aijfiij } [—(a 0 - 1 )if'(aij + bij ) + (bo - 1 ){if'(bij) - if(dij + bij)} 


+ 


if'(aij) - if'(aij + bij) 
-if'(aij + b^) 


^ ' ^dd/l^dd/j T 


O 0 - 1 

bo ~ 1 


+ 


(d,d'y. 

Vdd' = l 

S(<2,<2')eSi K dd'i^dd'j 

0 


-h 


V 


K dd'i V dd' j 9d' 

E (aij+b t j) 2 g d i+h d , 

1_ Ft _ ffl u 

(d,d'): 9d! +Kd a ijd-bij 

Vdd'=° 


+ 


( a ij + bij) 2 \I aiit bi. 
D 


(dtj T bij)if (dij T bij) b^if (bij) 
djjlf (dij) (dij T bij)lf (dij + bij) 


^ {^d:y dd ,=0 K dd’iVdd'j}gJ+ hd , 


d' = 1 


2 Vdd'-- 
1 - 


9d' _ a ij 

9d'+ h d' a ij+bij 
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where \I a fi\ = ip'(a)ip'(b) - ip' (a + b{ip'(a) + ip'(b )}. 
Similarly, 


' 9 d’ - 1 


f dE q {logp(y,e)}-\ 

— / -1 

d 9 d ' 

hd' ~ 1. 

9d' 1 h d' 

8E q {\ogp(y,e)} 

L dh d , J 


= I 


-l 

9d’’ h d' 


( 9o - 1 )bP'{9d') - V>'( 9d 1 + h d ')} - (ho - 1 W(9d' + h d >) 
-(go - 1 W(gd' + hd') + (h 0 - 1 ){ip'(h d t) - ip'(g d i + h d >)} 


+ 


ip'(9d') ~ ip'(gd' + h d ') 

-ip'(gd' + h d >) 


Udd 1 + 

Jrf: 2/dd' =1 


—h d > 

9d’ 


EE ^dd , i l, dd , j a ij 

d '-Vdd'= Q (gd'+ fe d') 2 a ij+ b ij 

J _ 9d' _ a ij 

i,j 9d'+ h d' aij+bij 


go -1 
ho — 1 


+ 


’^2id:y r , rl i= 1 9dd' 


■■Vdd' 

0 


+ 


1 


(gd> + h d ') 2 \Ig dlj h d ,\ 


(9d> + h d ')ip'(g d ' + hd!) - h d iip'(h d i) 
9d'ip'(9d') ~ (9d> + h d i)ip'(g d > + h d :) 


E 


{Y.d:y dd ,= 0 K dd'iVdd'j} a + bi . 


1 - 


9d> 


9d'+b-d' a ij+bij 


S2. Illustration of sampling strategy. We use a simple example to 
illustrate how the sampling strategy described in Section 4.1 can be imple¬ 
mented in the case where publication times are available. Suppose we have 
a dataset with D = 10 documents published in three time periods, say doc¬ 
uments 1-3, 4-7 and 8-10 are published respectively in the periods t = 1, 2 
and 3. Assuming that documents can only cite other documents published in 
the same or earlier time periods, the block lower triangular matrix in Figure 
SI represent all document pairs where links might be observed. Suppose a 
random sample S = {1, 5, 8,10} is drawn, Bernoulli trials are performed for 
document pairs in the blue squares and successful pairs are included in V. 


Cited (d') 

123456789 10 



Fig SI. Documents 1-10 published in three time periods, t = 1,2,3. Suppose S = 
{1,5,8,10}. Grey squares represent document pairs (d,d') for which links might be ob¬ 
served. Bernoulli trials are performed for document pairs in blue squares. Successful pairs 
are included in V. 
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S3. Horvitz-Thompson estimators. The estimators in (4.3) are con¬ 
structed through the use of weighted averaging. To show that 



E K ll'i v ll'' 

7 Til' 

' 1 mev. d , 

d'eS 


/ ^dd'i^dd 1 j j 

(d,d') 


imagine that document pairs are selected through the following two-stage 
cluster sampling procedure P: 

1. Randomly select a sample S of documents. The columns corresponding 
to these documents in the adjacency matrix are the clusters selected at 
the first stage. 

2. Perform a Bernoulli trial with success probability Tin' for each document 
pair ( l , l') in these columns (i.e. for l 1 e S ) and select those pairs that are 
successful. 

Now, introduce a binary random variable Zym for each document pair (l, l') 
such that Znit\ = 1 if (/, l') is selected through procedure P and 0 otherwise. 
Then 

P(Z m = l) = P{d' € S)P((l,l') € V. d ,\d' eS) = 

Therefore 


E 



E 


K ll'i v ll'j 


W)ep. d , 




d'&S 



Kll'iVll'j 

KW 


Zw 


D 

151 


E 

m 


Kll'jVll'j 

nil' 


E(Z W ) 


^ ( ^dd’i^dd' j • 
(d,d>) 


S4. Implementing the stochastic variational algorithm. At every 
iteration of the stochastic variational algorithm, we randomly sample a mini¬ 
batch S of documents and update document specific variational parameters 
only for d G S. Sampling with or without replacement may be used as both 
approaches produce unbiased estimates. To ensure that document specific 
variational parameters are updated uniformly, we sample randomly without 
replacement at each iteration. That is, starting with D documents, we ran¬ 
domly sample |<S| documents in the first iteration. At the second iteration, 
we sample another |<S| documents from the remaining D — |<S| documents. 
This continues for say M iterations until there are no documents left and 
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we consider this as one sweep through the corpus. All documents are then 
replaced and we start making a second sweep. This process is repeated until 
convergence. In this approach, all document specific variational parameters 
are updated once in each sweep. 

Recall that the updates of {a, b} and {< 7 , h} are highly interdependent and 
hence a nested loop was introduced in Algorithm 1 for these parameters. 
While it is not possible to use a nested loop in Algorithm 2 (conditions 
for convergence of stochastic gradient algorithms will not be satisfied), we 
find that using moderately smaller minibatch sizes alleviates this issue as 
the frequency of feedback between these updates is increased within each 
sweep. 

We use a step size of the form — m 1 . , u , , where M denotes the number 

^ (s®+ m+ A 2 ) ’ 

of iterations required to make a sweep through the corpus and 0 < m < M— 1 
denotes the number of minibatches that have been analyzed at the s w th 
sweep. This step size is of the form st = with w = A\M V and 

W = M(A 2 + 1) — 1. For the Cora and KDD datasets, we set A\ = 1, 
A 2 = 5, v = 0.501 for the stepsize. 

55. Priors and stopping criteria. The same priors are used across 
all models. All hyperparameters of beta distributions are set as 1. We have 
tried more informative priors for the blockmodel but observed that the al¬ 
gorithms are not very sensitive to the hyperparameter values. In the real 
examples, we set a = [1/K,... ,1/K\. A symmetric Dirichlet(r/) prior is 
used for each fik- In our applications, we observed that small 77 leads to 
better predictive performance for the RTM while a larger 77 works better 
for Pairwise-Link-LDA and the LMV. This is likely due to the difference 
in modeling assumptions; RTM considers a diagonal weight matrix while 
Pairwise-Link-LDA and LMV use a blockmodel. 

All algorithms running in batch mode are stopped when the relative in¬ 
crease in the lower bound is less than 1CP 5 . For stochastic algorithms, we 
consider the relative change in the diagonal of the blockmodel after each 
sweep and stop when ||diag(R^“) — r( s ™ _1 ))|| < e||diag(R^ s “ _1 ' ) )||, where 
|| ■ || denotes the Euclidean norm and e is some small tolerance. We set 
e = 0.05 for the Cora dataset and e = 0.1 for the KDD dataset. 

56. Examining quality of latent topics. The quality of uncovered 
latent topics may be validated by examining the coherence of top words in 
each topic, comparing topics or identifying documents that are most repre¬ 
sentative of each topic. Chuang, Manning and Heer (2012) present a visual¬ 
ization system called Termite that supports evaluation of topics by introduc¬ 
ing measures for ranking and sorting terms. Wallach et al. (2009) propose 
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Fig S2. Graphical representation of K = 6 topics. First row: simulated topics. Second 
row: topics found by LMV (Algorithm 1). 


and compare methods for evaluating topic models based on estimating the 
probability of held-out documents given a trained model, while Mimno and 
Blei (2011) present a new Bayesian method based on posterior predictive 
checking for measuring how well a topic model fits a corpus. 

S7. Simulation study. We present here results of the simulation study 
mentioned in Section 7. To evaluate the accuracy and computational per¬ 
formance of Algorithms 1 and 2, we generate 20 datasets each consisting 
of 3000 documents from the LMV. We consider K = 6 topics generated 
from the symmetric Dirichlet(O.l) distribution. These topics are represented 
using images in Figure S2 (first row). Each image contains V = 100 pixels 
in a 10 x 10 grid and each pixel corresponds to a term in the vocabulary. 
The intensity of a pixel represents its frequency (grid is for ease of reading). 
We note that the text may not have been simulated in a realistic manner. 
However, we wanted an example where the topics can be visualized and 
recovered easily so that we can test the abilities of Algorithms 1 and 2 in 
recovering the blockmodel and visibilities. When the simulated topics are 
not sufficiently distinct or if there are overlaps, the inferred topics may turn 
out to be some combination of the simulated topics. Multiple runs may also 
produce varying results. In such cases, it will then not be possible to recover 
the true blockmodel and visibilities. To avoid such scenarios, we have used 
an example with a small number of well differentiated topics so that we can 
evaluate the accuracy of the recovered blockmodel and visibilities and study 
the behaviour of different algorithms on a common basis. 

Topic assignments are generated from symmetric Dirichlet(0.05) and the 
visibility of each document is generated from Beta(l, 1). We consider the 
blockmodel B shown in Figure S3. The table on the left shows the values 
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Fig S3. True blockmodel B. 




Fig S4. Barplots of average predictive ranks (left) and CPU times (right) for 20 simulated 
datasets. 


in B while the image on the right is a visualization of B. The values on the 
diagonal are larger, indicating a higher probability of within-topic citations. 
There are also certain off-diagonal elements which are nonzero but smaller 
in value indicating a smaller probability of citations across topics. Each of 
the 20 datasets differ in terms of links and text but are simulated using 
the same topics, blockmodel and visibilities. The number of words in each 
document is 100. The first 2000 documents are used for training and the 
remaining 1000 documents are used as the test set. 

The simulated datasets are fitted using LDA+Regression, RTM, Pairwise- 
Link-LDA and Algorithms 1 and 2. For Algorithm 2, we let the mini-batch 
size be 200 and £ = 0.015. For the step size, we let A\ = 2, A 2 = 3 and 
v = 0.501. Hyperparameters used to generate the datasets were used as prior 
hyperparameters. All methods were able to recover topics that were close to 
the ones used to simulate the data. The topics found by LMV (Algorithm 1) 
are shown in the second row of Figure S2. The average predictive ranks and 
CPU times of different methods are shown in Figure S4. Average predictive 
ranks of LDA+Regression and RTM are similar while Pairwise-Link-LDA 
does slightly better. LMV yields the lowest average predictive ranks and 
results produced by Algorithms 1 and 2 are very close. In this example, 
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LMV LMVS 



0.0 0.4 0.8 0.0 0.4 0.8 

True latent quality True latent quality 


Fig S5. Plots of posterior mean visibility (averaged over 20 datasets) against true visibility 
for Algorithms 1 and 2. 


the true blockmodel B has some nonzero off-diagonal elements which can 
be captured by Pairwise-Link-LDA and the LMV, but LDA+Regression and 
the RTM cannot model citations across different topics. LMV also performed 
better than Pairwise-Link-LDA because the simulated data use visibilities to 
distinguish the probability of being cited among documents of similar topics 
and Pairwise-Link-LDA does not have this flexibility. In terms of CPU times, 
RTM is the fastest as it scales as 0(DK ) followed by LDA+Regression. 
LMV and Pairwise-Link-LDA took much longer to converge as they scale 
as 0(D 2 K 2 ). However, subsampling can help to reduce the runtimes while 
producing equally good predictions. In examples where the data size is large, 
subsampling also helps to overcome memory constraints. 

Next, we study the accuracy of Algorithms 1 and 2. Figure S5 plots the 
posterior mean visibility averaged over 20 datasets against the true visibility. 
The estimated posterior mean visibility is close to the true visibility for both 
algorithms 1 and 2. However, there is a slight underestimation of visibilities 
that are very close to 1 in both Algorithms and a slight overestimation more 
generally in Algorithm 2. Table SI shows the mean and standard deviation 
of the posterior mean estimates of elements in the blockmodel B computed 
using LMV (Algorithms 1 and 2) and Pairwise-Link-LDA. Only estimates 
of nonzero elements of B are shown. For elements of B which are zero, all 
estimates are less than 0.0022 across all simulated datasets for the LMV (Al¬ 
gorithms 1 and 2) and Pairwise-Link-LDA. The estimates from Algorithm 
1 are very close to the true values in B. There is slight underestimation in 
Algorithm 2 and greater variation in results across different datasets due 
to the subsampling of documents and zeros. It is also interesting to note 
that estimates of the blockmodel in Pairwise-Link-LDA are much smaller 
or about half of the true values in B. As Pairwise-Link-LDA does not dis- 
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Table SI 

Mean and standard deviation (after i) of the posterior mean estimates of nonzero 
blockmodel elements over 20 simulated datasets. 


True value LMV LMVS PLLDA 


B[ LI] 

0.3 

0.298 

± 

0.005 

0.279 

± 

0.007 

0.146 

± 

0.004 

B[ 2,2] 

0.3 

0.299 

± 

0.004 

0.278 

± 

0.006 

0.147 

± 

0.003 

£[3,3] 

0.3 

0.297 

± 

0.005 

0.278 

± 

0.007 

0.147 

± 

0.005 

£[4,4] 

0.2 

0.200 

± 

0.004 

0.187 

± 

0.004 

0.098 

± 

0.003 

£[5,5] 

0.2 

0.203 

± 

0.004 

0.187 

± 

0.006 

0.099 

± 

0.003 

£[6,6] 

0.2 

0.201 

± 

0.004 

0.187 

± 

0.003 

0.097 

± 

0.003 

£[2,5] 

0.05 

0.049 

± 

0.001 

0.049 

± 

0.003 

0.024 

± 

0.001 

£[3,4] 

0.05 

0.048 

± 

0.001 

0.049 

± 

0.002 

0.023 

± 

0.001 

£[4,1] 

0.04 

0.039 

± 

0.001 

0.039 

± 

0.002 

0.019 

± 

0.001 

£[5,2] 

0.04 

0.039 

± 

0.001 

0.039 

± 

0.002 

0.019 

± 

0.001 

£[1,6] 

0.03 

0.030 

± 

0.001 

0.030 

± 

0.001 

0.014 

± 

0.001 

£[6,3] 

0.03 

0.029 

± 

0.001 

0.029 

± 

0.001 

0.014 

± 

0.001 

£[1,3] 

0.02 

0.020 

± 

0.001 

0.020 

± 

0.001 

0.010 

± 

0.000 

£[4,6] 

0.02 

0.020 

± 

0.001 

0.021 

± 

0.001 

0.010 

± 

0.000 

£[6,1] 

0.02 

0.020 

± 

0.001 

0.020 

± 

0.001 

0.010 

± 

0.001 


tinguish between documents of similar topics with different visibilities, and 
the visibilities were simulated from Uniform[0,1], there is an averaging ef¬ 
fect where elements of B are scaled by 0.5. This simulated example shows 
that the LMV has the potential to improve citation predictions in real-world 
scenarios. While running the LMV in batch mode is time consuming, sub¬ 
sampling can reduce the runtime and memory requirements while producing 
competitive estimations and predictions. 
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